[train] Updates to support `xgboost==2.1.0` #46667

justinvyu · 2024-07-16T20:26:54Z

Why are these changes needed?

xgboost 2.1.0 was recently released, and it changed some of the distributed setup APIs.

In particular:

CollectiveCommunicator was changed to not read from environment variables. Instead, an argument dict is required to be passed in: dmlc/xgboost@a5a5810#diff-a74bc610352aa00eda4ae89c1f3a51c33b934f50f055f366def49654efb42992
The RabitTracker API was updated with a few renamed methods and changed behavior:
- worker_envs() -> worker_args()
- RabitTracker.wait_for must be run as a separate thread in order for worker_args to return properly: dmlc/xgboost@a5a5810#diff-94301e6ca68aefc564a0d617db4ab2de3425b2cca9d66fe95ad8c7ce97399c14R182-R186

This PR branches the setup logic between pre 2.1.0 and post 2.1.0. We should eventually drop pre-2.1.0 support.

Testing

This PR also updates the tested xgboost version to 2.1.0. Pre-2.1.0 has been tested manually.

Related issue number

Closes #46476

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…10compat

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…10compat Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…10compat Signed-off-by: Justin Yu <justinvyu@anyscale.com>

hongpeng-guo

Thanks for the implementation! Two questions from me:

The RabitTracker class functions worker_args and worker_env resturn the same type of things Dict[str, Union[int, str]]. The only difference is the key of worker_env are uppercase letters but worker_args are lowercase letters. Our adaption to this change is to move from env settup to training context settup, is that correct?
The RabitTracker class doesn't maintain a thread itself, instead we need to create a main thread kind of thing using its wait_for method to wait for tracker.start by ourselves. My question is: Do we also need to distinguish xgboost version before/post 210 in the on_shutdown method. My first intuition is to using tracker.thread before 210, and using wait_for after 210. Currently, It seems we always use the wai_for method now.

python/ray/train/xgboost/config.py

…10compat

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu · 2024-07-25T21:16:40Z

The RabitTracker class functions worker_args and worker_env resturn the same type of things Dict[str, Union[int, str]]. The only difference is the key of worker_env are uppercase letters but worker_args are lowercase letters. Our adaption to this change is to move from env settup to training context settup, is that correct?

Yes. The API changed from accepting environment variables to only allowing you to pass the arguments directly as kwargs with those lower-case names.

…10compat

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

.buildkite/others.rayci.yml

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…10compat

hongpeng-guo · 2024-08-06T20:19:38Z

python/ray/train/xgboost/config.py

+    def on_training_start(
+        self, worker_group: WorkerGroup, backend_config: XGBoostConfig
+    ):
+        assert backend_config.xgboost_communicator == "rabit"


nit: it seems XGBoostingConfig has a hard coded backend_config field being "rabit", why do we still need an assertion here?

yeah, I can probably remove this field for now, since we don't support the "federated" option.

python/requirements/ml/core-requirements.txt

python/requirements_compiled.txt

hongpeng-guo

Left some comments, most of them are nits and should be non-blocking.
It should be good to go if all unit tests look good.

python/ray/train/xgboost/config.py

woshiyyya · 2024-08-06T21:21:12Z

python/ray/train/xgboost/config.py

@@ -37,28 +41,93 @@ class XGBoostConfig(BackendConfig):
    def train_func_context(self):
        @contextmanager
        def collective_communication_context():
-            with CommunicatorContext():
+            with CommunicatorContext(**_get_xgboost_args()):


Are we able to save the xgboost_args into XGBoost config so we can avoid modifying the global variable?

Hmm, interesting. I actually don't understand why we need both BackendConfig and Backend classes. Any context here @matthewdeng ?

The BackendConfig is the public API that the user could interact with. There is probably a better/cleaner way to organize the two.

Yeah currently the dependency between BackendConfig and Backend are unidirectional. It's kind of hard to pass information from Backed -> BackendConfig.

Should train_func_context be part of the Backend instead?

Or at the very least the default one.

Oh hm maybe that won't work because we construct the train loop before the backend...

woshiyyya

lgtm!

hongpeng-guo

LGTM!

…10compat

Support xgboost 2.1.0, which was recently released and changed some of the distributed setup APIs. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Dev <dev.goyal@hinge.co>

justinvyu added 4 commits July 16, 2024 13:15

make compatible for xgboost 2.1.0

386f731

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

update xgboost to 2.1.0

d82323c

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

make global var naming private

da5eadc

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

small cleanup

49d645c

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu assigned matthewdeng Jul 16, 2024

justinvyu requested review from hongpeng-guo, matthewdeng, raulchen and woshiyyya as code owners July 16, 2024 20:26

justinvyu added 5 commits July 17, 2024 14:25

Merge branch 'master' of https://github.com/ray-project/ray into xgb2…

e6297c4

…10compat

update req-compiled

f97bbf4

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into xgb2…

661d6ad

…10compat Signed-off-by: Justin Yu <justinvyu@anyscale.com>

update requirements compiled

2170aa3

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into xgb2…

6ea9533

…10compat Signed-off-by: Justin Yu <justinvyu@anyscale.com>

hongpeng-guo reviewed Jul 25, 2024

View reviewed changes

python/ray/train/xgboost/config.py Show resolved Hide resolved

justinvyu added 3 commits July 25, 2024 13:29

Merge branch 'master' of https://github.com/ray-project/ray into xgb2…

dc3c878

…10compat

[TEMP] remove ci dep for pip compile to run

030f415

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

update req compiled

2a83490

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu added 5 commits August 1, 2024 14:52

Merge branch 'master' of https://github.com/ray-project/ray into xgb2…

7ad42d4

…10compat

Merge branch 'master' of https://github.com/ray-project/ray into xgb2…

a7ac8a0

…10compat

separate into 2 different classes

b09df85

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

TEMP: add nvidia nccl dep

8518fee

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

update req compiled

3a9fed6

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

hongpeng-guo reviewed Aug 6, 2024

View reviewed changes

.buildkite/others.rayci.yml Outdated Show resolved Hide resolved

justinvyu added 2 commits August 6, 2024 13:14

revert TEMP

fd9ad1a

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into xgb2…

5053ef8

…10compat

hongpeng-guo reviewed Aug 6, 2024

View reviewed changes

python/requirements/ml/core-requirements.txt Show resolved Hide resolved

hongpeng-guo reviewed Aug 6, 2024

View reviewed changes

python/requirements_compiled.txt Outdated Show resolved Hide resolved

hongpeng-guo reviewed Aug 6, 2024

View reviewed changes

woshiyyya reviewed Aug 6, 2024

View reviewed changes

woshiyyya approved these changes Aug 7, 2024

View reviewed changes

hongpeng-guo approved these changes Aug 7, 2024

View reviewed changes

justinvyu enabled auto-merge (squash) August 7, 2024 22:28

github-actions bot added the go add ONLY when ready to merge, run all tests label Aug 7, 2024

Merge branch 'master' of https://github.com/ray-project/ray into xgb2…

f40de02

…10compat

github-actions bot disabled auto-merge August 8, 2024 00:33

justinvyu merged commit c634872 into ray-project:master Aug 8, 2024
5 checks passed

justinvyu deleted the xgb210compat branch August 8, 2024 18:48

justinvyu mentioned this pull request Oct 7, 2024

Ray Train incompatible broken with XGBoost 2.1.0 #46476

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] Updates to support `xgboost==2.1.0` #46667

[train] Updates to support `xgboost==2.1.0` #46667

justinvyu commented Jul 16, 2024

hongpeng-guo left a comment

justinvyu commented Jul 25, 2024

hongpeng-guo Aug 6, 2024

justinvyu Aug 6, 2024

hongpeng-guo left a comment

woshiyyya Aug 6, 2024

justinvyu Aug 6, 2024

matthewdeng Aug 6, 2024

woshiyyya Aug 7, 2024

matthewdeng Aug 7, 2024

matthewdeng Aug 7, 2024

matthewdeng Aug 7, 2024

woshiyyya left a comment

hongpeng-guo left a comment

[train] Updates to support xgboost==2.1.0 #46667

[train] Updates to support xgboost==2.1.0 #46667

Conversation

justinvyu commented Jul 16, 2024

Why are these changes needed?

Testing

Related issue number

Checks

hongpeng-guo left a comment

Choose a reason for hiding this comment

justinvyu commented Jul 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hongpeng-guo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

woshiyyya left a comment

Choose a reason for hiding this comment

hongpeng-guo left a comment

Choose a reason for hiding this comment

[train] Updates to support `xgboost==2.1.0` #46667

[train] Updates to support `xgboost==2.1.0` #46667